In [1]:
#https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day2_Simple_Linear_Regression.md
In [13]:
# Step 1: Data Preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv('studentscores.csv')
X = dataset.iloc[ : , : 1 ].values
Y = dataset.iloc[ : , 1 ].values
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 1/4, random_state = 0)
In [14]:
print(X_train, '\n\n', Y_train)
In [15]:
print(X_test, '\n\n', Y_test)
In [16]:
plt.scatter(X_train , Y_train, color ='red')
Out[16]:
In [17]:
# Step 2: Fitting Simple Linear Regression Model to the training set
from sklearn.linear_model import LinearRegression # Ordinary least squares Linear Regression
regressor = LinearRegression()
regressor = regressor.fit(X_train, Y_train) # Fit linear model
In [19]:
regressor.score(X_train, Y_train) # Returns the coefficient of determination R^2 of the prediction.
Out[19]:
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
In [21]:
# Estimated coefficients for the linear regression problem.
# If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features),
# while if only one target is passed, this is a 1D array of length n_features.
regressor.coef_
Out[21]:
In [24]:
regressor.intercept_ # Independent term in the linear model.
Out[24]:
In [22]:
# Step 3: Predecting the Result
Y_pred = regressor.predict(X_test)
In [25]:
print(Y_pred)
In [26]:
#Step 4: Visualization
# Visualising the Training results
plt.scatter(X_train , Y_train, color = 'red')
plt.plot(X_train , regressor.predict(X_train), color ='blue')
Out[26]:
In [27]:
# Visualizing the test results
plt.scatter(X_test , Y_test, color = 'red')
plt.plot(X_test , regressor.predict(X_test), color ='blue')
Out[27]:
In [42]:
X_test
Out[42]:
In [43]:
Y_test
Out[43]:
In [12]:
regressor.score(X_train, Y_train)
Out[12]:
In [28]:
regressor.predict(X_train)
Out[28]:
In [40]:
regressor.predict(np.array([[2]])) # o valor que vai dentro dos dois colchetes é o valor de x na reta! o retorno é o y
Out[40]:
In [41]:
regressor.predict(np.array([[5]]))
Out[41]:
In [45]:
regressor.predict(np.array([[1.5]])) # valor na reta! Não nos pontos de treino ou teste.
Out[45]:
In [46]:
# https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html
In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline
In [48]:
dataset = pd.read_csv('Weather.csv')
In [49]:
dataset.head()
Out[49]:
In [50]:
dataset.shape
Out[50]:
In [51]:
dataset.info()
In [52]:
dataset.describe()
Out[52]:
And finally, let’s plot our data points on a 2-D graph to eyeball our dataset and see if we can manually find any relationship between the data using the below script :
In [53]:
dataset.plot(x='MinTemp', y='MaxTemp', style='o')
plt.title('MinTemp vs MaxTemp')
plt.xlabel('MinTemp')
plt.ylabel('MaxTemp')
plt.show()
Let’s check the average max temperature and once we plot it we can observe that the Average Maximum Temperature is Between Nearly 25 and 35.
In [54]:
plt.figure(figsize=(15,10))
plt.tight_layout()
seabornInstance.distplot(dataset['MaxTemp'])
Out[54]:
Our next step is to divide the data into “attributes” and “labels”.
Attributes are the independent variables while labels are dependent variables whose values are to be predicted. In our dataset, we only have two columns. We want to predict the MaxTemp depending upon the MinTemp recorded. Therefore our attribute set will consist of the “MinTemp” column which is stored in the X variable, and the label will be the “MaxTemp” column which is stored in y variable.
In [55]:
X = dataset['MinTemp'].values.reshape(-1,1)
y = dataset['MaxTemp'].values.reshape(-1,1)
Next, we split 80% of the data to the training set while 20% of the data to test set using below code.
The test_size variable is where we actually specify the proportion of the test set.
In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
After splitting the data into training and testing sets, finally, the time is to train our algorithm. For that, we need to import LinearRegression class, instantiate it, and call the fit() method along with our training data.
In [57]:
regressor = LinearRegression()
regressor.fit(X_train, y_train) #training the algorithm
Out[57]:
As we have discussed that the linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. To see the value of the intercept and slop calculated by the linear regression algorithm for our dataset, execute the following code.
In [58]:
#To retrieve the intercept:
print(regressor.intercept_)
#For retrieving the slope:
print(regressor.coef_)
The result should be approximately 10.66185201 and 0.92033997 respectively.
This means that for every one unit of change in Min temperature, the change in the Max temperature is about 0.92%.
Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score. To make predictions on the test data, execute the following script:
In [59]:
y_pred = regressor.predict(X_test)
Now compare the actual output values for X_test with the predicted values, execute the following script:
In [60]:
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df
Out[60]:
We can also visualize comparison result as a bar graph using the below script :
Note: As the number of records is huge, for representation purpose I’m taking just 25 records.
In [61]:
df1 = df.head(25)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
Though our model is not very precise, the predicted percentages are close to the actual ones.
Let's plot our straight line with the test data :
In [62]:
plt.scatter(X_test, y_test, color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()
In [63]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
In [64]:
# https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html/2
In [65]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: